Model Selection

Image-Text Matching

# Image-Text Matching

Sail Clip Hendrix 10epochs

A vision-language model fine-tuned from openai/clip-vit-large-patch14, trained for 10 epochs

A large-scale vision-language model based on Vision Transformer architecture, supporting cross-modal understanding between images and text

Vilt Finetuned 200

Vision-language model based on ViLT architecture, fine-tuned for specific tasks

Clip Vit Large Patch14

OpenAI's open-source CLIP model, based on Vision Transformer (ViT) architecture, supporting joint understanding of images and text.

CLIP Giga Config Fixed

A large CLIP model trained on the LAION-2B dataset, using ViT-bigG-14 architecture, supporting cross-modal understanding between images and text

Japanese Cloob Vit B 16

Japanese CLOOB (Contrastive Leave-One-Out Boost) model trained by rinna Co., Ltd. for cross-modal understanding of images and text

Transformers Japanese

Clip Vit Large Patch14 336

A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text

Distilbert Base Turkish Cased Clip

A Turkish text encoder fine-tuned from dbmdz/distilbert-base-turkish-cased, designed to work with CLIP's ViT-B/32 image encoder

Clip Vit B 32 Japanese V1

This is a Japanese CLIP text/image encoder model converted from the English CLIP model through distillation techniques.

Transformers Japanese

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase